Note to self: Use http://www.stat.cmu.edu/~cshalizi/rmarkdown/ to figure out math symbols!

2. Simulation, linear regression, variance explained, predictions and predictive accuracy

In this exercise you’ll create some simulated data from a linear model of a continuous outcome and will fit simple regression models to them. Make sure to set a seed prior to starting part (a) to ensure consistent results. Use base R and the rms package.

(a) Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N(0, 1) distribution. This represents a feature, X.

set.seed(1984)
x <- rnorm(100, 0, 1)

(b) Using the rnorm()function, create a vector, eps, containing 100 observations drawn from a N(0, 0.25) distribution i.e. a normal distribution with mean zero and variance 0.25.

set.seed(1776)
eps <- rnorm(100, 0, 0.25)

(c) Using x and eps, generate a vector y according to the model

Y = -1 + 0.5X + \(\epsilon\)

y <- -1 + 0.5 * x + eps

(d) Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

plot(x, y)

Result is what appears to be an approximately linear relationship between x and y

(e) Fit a least squares linear model to predict y using x. Comment on the model obtained. How do \(\hat{\beta_0}\) and \(\hat{\beta_1}\) compare to \(\beta_0\) and \(\beta_1\) ?

g <- rms::ols(y ~ x, x=T, y=T, se.fit=T)
g
## Linear Regression Model
##  
##  rms::ols(formula = y ~ x, x = T, y = T, se.fit = T)
##  
##                 Model Likelihood     Discrimination    
##                    Ratio Test           Indexes        
##  Obs     100    LR chi2    155.61    R2       0.789    
##  sigma0.2725    d.f.            1    R2 adj   0.787    
##  d.f.     98    Pr(> chi2) 0.0000    g        0.599    
##  
##  Residuals
##  
##       Min       1Q   Median       3Q      Max 
##  -0.58362 -0.17994 -0.01112  0.18878  0.76363 
##  
##  
##            Coef    S.E.   t      Pr(>|t|)
##  Intercept -0.9757 0.0273 -35.78 <0.0001 
##  x          0.5570 0.0291  19.15 <0.0001 
## 
anova(g)
##                 Analysis of Variance          Response: y 
## 
##  Factor     d.f. Partial SS MS          F      P     
##  x           1   27.220947  27.22094654 366.54 <.0001
##  REGRESSION  1   27.220947  27.22094654 366.54 <.0001
##  ERROR      98    7.277929   0.07426458
g_dd <- datadist(x, y)
options(datadist="g_dd")
summary(g)
##              Effects              Response : y 
## 
##  Factor Low      High    Diff. Effect  S.E.     Lower 0.95 Upper 0.95
##  x      -0.62964 0.80631 1.436 0.79983 0.041777 0.71692    0.88273
AIC(g)
##     d.f. 
## 27.75532
hist(g$residuals, breaks=14)

plot(g$residuals, x)

plot(x, y)
abline(g)

Line generated by least squares method goes through center of random data that has a linear shape. By visual inspection, some are close to the linear function, but most are spread evenly around the line.

As shown via the residual plot, the values are off from the regression line by apparently a random amount (doesn’t seem to have a visual pattern) ranging from -0.6 to 0.8.

(f) Now fit a polynomial regression model that predicts y using x and x2. Is there evidence that the quadratic term improves the model fit? Explain your answer.

x2 = x^2
p <- rms::ols(y ~ x + x2, x=T, y=T, se.fit=T)
p
## Linear Regression Model
##  
##  rms::ols(formula = y ~ x + x2, x = T, y = T, se.fit = T)
##  
##                 Model Likelihood     Discrimination    
##                    Ratio Test           Indexes        
##  Obs     100    LR chi2    158.02    R2       0.794    
##  sigma0.2706    d.f.            2    R2 adj   0.790    
##  d.f.     97    Pr(> chi2) 0.0000    g        0.602    
##  
##  Residuals
##  
##      Min      1Q  Median      3Q     Max 
##  -0.5991 -0.1812  0.0140  0.1976  0.7676 
##  
##  
##            Coef    S.E.   t      Pr(>|t|)
##  Intercept -0.9419 0.0349 -27.03 <0.0001 
##  x          0.5655 0.0294  19.23 <0.0001 
##  x2        -0.0388 0.0252  -1.54 0.1270  
## 
anova(p)
##                 Analysis of Variance          Response: y 
## 
##  Factor     d.f. Partial SS MS         F      P     
##  x           1   27.0724905 27.0724905 369.63 <.0001
##  x2          1    0.1735035  0.1735035   2.37 0.127 
##  REGRESSION  2   27.3944500 13.6972250 187.01 <.0001
##  ERROR      97    7.1044253  0.0732415
p_dd <- datadist(x, x2, y)
options(datadist="p_dd")
summary(p)
##              Effects              Response : y 
## 
##  Factor Low      High    Diff.  Effect   S.E.     Lower 0.95 Upper 0.95
##  x      -0.62964 0.80631 1.4360  0.81200 0.042235  0.72817   0.895820  
##  x2      0.13760 1.33190 1.1943 -0.04628 0.030069 -0.10596   0.013398
AIC(p)
##     d.f. 
## 27.34248
#plot(p$residuals, x)
hist(residuals(p), breaks=14)

plot(fitted(p),residuals(p))

plot(x, y)
y_pred <- predict(p, data.frame(x, x2))
lines(sort(x), y_pred[order(x)], col = "red")

The p value for the x^2 term is 0.1270, so would not meet the standard 0.05 threshold. Also, the coeffiecient for the x^2 term is close to 0, so doesn’t impact the fit line by much. Visually, the quadratic fit line isn’t much better than the linear line; additionally, the histogram for the residuals are similar between the linear and polynomial equations. Despite the poor p value, low coefficient, and similar residuals between models, the AIC for the polynomial is slightly better than the AIC for the linear model (though neither is very good).

(g) Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data. Describe your results.

set.seed(1984)
x_ln <- rnorm(100, 0, 1)
set.seed(1776)
eps_ln <- rnorm(100, 0, 0.125)
y_ln <- -1 + 0.5 * x_ln + eps_ln
plot(x_ln, y_ln)

g_ln <- rms::ols(y_ln ~ x_ln, x=T, y=T, se.fit=T)
g_ln
## Linear Regression Model
##  
##  rms::ols(formula = y_ln ~ x_ln, x = T, y = T, se.fit = T)
##  
##                 Model Likelihood     Discrimination    
##                    Ratio Test           Indexes        
##  Obs     100    LR chi2    267.20    R2       0.931    
##  sigma0.1363    d.f.            1    R2 adj   0.930    
##  d.f.     98    Pr(> chi2) 0.0000    g        0.568    
##  
##  Residuals
##  
##        Min        1Q    Median        3Q       Max 
##  -0.291808 -0.089971 -0.005558  0.094390  0.381813 
##  
##  
##            Coef    S.E.   t      Pr(>|t|)
##  Intercept -0.9878 0.0136 -72.46 <0.0001 
##  x_ln       0.5285 0.0145  36.33 <0.0001 
## 
anova(g_ln)
##                 Analysis of Variance          Response: y_ln 
## 
##  Factor     d.f. Partial SS MS          F       P     
##  x_ln        1   24.506556  24.50655555 1319.96 <.0001
##  REGRESSION  1   24.506556  24.50655555 1319.96 <.0001
##  ERROR      98    1.819482   0.01856614
g_ln_dd <- datadist(x_ln, y_ln)
options(datadist="g_ln_dd")
summary(g_ln)
##              Effects              Response : y_ln 
## 
##  Factor Low      High    Diff. Effect S.E.     Lower 0.95 Upper 0.95
##  x_ln   -0.62964 0.80631 1.436 0.7589 0.020888 0.71745    0.80036
print('AIC:')
## [1] "AIC:"
AIC(g_ln)
##      d.f. 
## -110.8741
hist(g_ln$residuals, breaks=14)

plot(g_ln$residuals, x_ln)

plot(x_ln, y_ln)
abline(g_ln, col='red')

x2_ln = x_ln^2
p_ln <- rms::ols(y_ln ~ x_ln + x2_ln, x=T, y=T, se.fit=T)
p_ln
## Linear Regression Model
##  
##  rms::ols(formula = y_ln ~ x_ln + x2_ln, x = T, y = T, se.fit = T)
##  
##                 Model Likelihood     Discrimination    
##                    Ratio Test           Indexes        
##  Obs     100    LR chi2    269.61    R2       0.933    
##  sigma0.1353    d.f.            2    R2 adj   0.931    
##  d.f.     97    Pr(> chi2) 0.0000    g        0.569    
##  
##  Residuals
##  
##        Min        1Q    Median        3Q       Max 
##  -0.299550 -0.090579  0.006998  0.098804  0.383779 
##  
##  
##            Coef    S.E.   t      Pr(>|t|)
##  Intercept -0.9710 0.0174 -55.72 <0.0001 
##  x_ln       0.5327 0.0147  36.23 <0.0001 
##  x2_ln     -0.0194 0.0126  -1.54 0.1270  
## 
anova(p_ln)
##                 Analysis of Variance          Response: y_ln 
## 
##  Factor     d.f. Partial SS  MS          F       P     
##  x_ln        1   24.02860017 24.02860017 1312.29 <.0001
##  x2_ln       1    0.04337587  0.04337587    2.37 0.127 
##  REGRESSION  2   24.54993143 12.27496571  670.38 <.0001
##  ERROR      97    1.77610633  0.01831037
p_ln_dd <- datadist(x_ln, x2_ln, y_ln)
options(datadist="p_ln_dd")
summary(p_ln)
##              Effects              Response : y_ln 
## 
##  Factor Low      High    Diff.  Effect   S.E.     Lower 0.95 Upper 0.95
##  x_ln   -0.62964 0.80631 1.4360  0.76499 0.021117  0.723080  0.8069000 
##  x2_ln   0.13760 1.33190 1.1943 -0.02314 0.015034 -0.052979  0.0066992
print('AIC:')
## [1] "AIC:"
AIC(p_ln)
##     d.f. 
## -111.287
#plot(p$residuals, x)
hist(residuals(p_ln), breaks=14)

plot(fitted(p_ln),residuals(p_ln))

plot(x_ln, y_ln)
y_pred <- predict(p_ln, data.frame(x_ln, x2_ln))
lines(sort(x_ln), y_pred[order(x_ln)], col = "red")

As with the noisier data set previously shown, the p value for the x^2 term does not meet the standard 0.05 threshold. In this case, the coeffiecient for the x^2 term is closer to 0 than before.

A visual comparison of the fit line and histogram appears that the linear and polynomial are still very similar.

For this less noisy dataset, the AIC for the polynomial is still better than the linear, though again they are both very similar. In comparison to the original random dataset, the AICs for these models has improved.

(h) Produce two separate plots showing the prediction bands for individual predictions from each of the models. Comment on what you observe about the uncertainty of predictions across the two models. What would you expect if you had built in more (rather than less) noise in the data

q2df <- data.frame(x, y, x2, eps)
ggplot(q2df, aes(x=x, y=y)) +
  geom_point() + 
  stat_smooth(method = 'lm', formula = y ~ x)

ggplot(q2df, aes(x=x, y=y)) +
  geom_point() + 
  stat_smooth(method = 'lm', formula = y ~ poly(x,2))

q2df_ln <- data.frame(x_ln, y_ln, x2_ln, eps_ln)
ggplot(q2df, aes(x=x_ln, y=y_ln)) +
  geom_point() + 
  stat_smooth(method = 'lm', formula = y ~ x)

ggplot(q2df, aes(x=x_ln, y=y_ln)) +
  geom_point() + 
  stat_smooth(method = 'lm', formula = y ~ poly(x,2))

As shown by the 95% confidence interval around the regression line/polynomial, there is some uncertainty in the best location for the prediction. The ends of the prediction line have the largest confidence interval. In comparison to the original vs less noisy data set, the less noisy data set has a smaller confidence interval. If the data set were to be more noisy, I would expect a larger confidence interval around the regression line.

(i) Familiarize yourself with the rms package in R. Use the validate.ols function to produce an “optimism- corrected” R2 value for each of the two models above. Explain how this differs from producing a bootstrap distribution of the R2 values themselves.

# had to modify call of rms::ols and add x=TRUE and y=TRUE
print('validate g')
## [1] "validate g"
validate(g)
##           index.orig training    test optimism index.corrected  n
## R-square      0.7890   0.7926  0.7860   0.0067          0.7824 40
## MSE           0.0728   0.0697  0.0738  -0.0042          0.0769 40
## g             0.5987   0.5932  0.5987  -0.0056          0.6043 40
## Intercept     0.0000   0.0000 -0.0140   0.0140         -0.0140 40
## Slope         1.0000   1.0000  0.9908   0.0092          0.9908 40
print('validate p')
## [1] "validate p"
validate(p)
##           index.orig training   test optimism index.corrected  n
## R-square      0.7941   0.7916 0.7889   0.0026          0.7914 40
## MSE           0.0710   0.0703 0.0728  -0.0026          0.0736 40
## g             0.6015   0.5951 0.6004  -0.0053          0.6068 40
## Intercept     0.0000   0.0000 0.0108  -0.0108          0.0108 40
## Slope         1.0000   1.0000 1.0111  -0.0111          1.0111 40

According to the rms::validate.ols documentation, “The validate function does resampling validation of a multiple linear regression model… Uses resampling to estimate the optimism in various measures of predictive accuracy which include R^2” Alternatively a bootstrap distribution of R^2 would find via sampling with replacement the distribution of possible R^2 values from the model.

3. Logistic regression, calibration and discrimination

An extensive data set was used to develop a prognostic model for 6-month unfavorable outcome (d.unfav) after moderate and severe TBI (d.gos = 3, 4). The strongest predictors were age, motor score and pupil reactivity. See “TBI Background Document.pdf” (in Canvas under Files -> Data) for the data dictionary. Note: Truncated systolic blood pressure (d.sysbt) is also included in the dataset but does not appear on the list in the document. The data are from Steyerberg and are in SPSS format. The dataset can be found in Canvas under Files -> Data -> TBI.sav.

(a) Using the rms package in R, fit a logistic regression model (binary outcome) with these 3 variables (age, d.motor and pupil.i) in the US data set (trial=75). First create 5 dummy variables for motor score and 2 for pupil reactivity. Be sure to investigate a potential nonlinear effect of age in the model.

wd <- getwd()
setwd("..")
parent <- getwd()

tbi <- read.spss(file = paste(parent,'TBI.sav', sep='/'), to.data.frame = TRUE)
## re-encoding from CP1252
print(paste(parent,'TBI.sav', sep='/'))
## [1] "/Users/seth/OneDrive - The University of Colorado Denver/Documents/BIOS_6645/TBI.sav"
setwd(wd)

# filter dataset
tbi = tbi[tbi$trial == 'Tirilazad US', ]

# convert d.motor column to factor
tbi$d.motor <- as.factor(tbi$d.motor)

LRM model using built in factor method to create dummy variables

tbi_model <- lrm(d.unfav ~ age + d.motor + pupil.i, data=tbi, x=TRUE, y=TRUE)
tbi_model
## Logistic Regression Model
##  
##  lrm(formula = d.unfav ~ age + d.motor + pupil.i, data = tbi, 
##      x = TRUE, y = TRUE)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs          1041    LR chi2     262.94    R2       0.304    C       0.784    
##   0            646    d.f.             8    g        1.346    Dxy     0.567    
##   1            395    Pr(> chi2) <0.0001    gr       3.841    gamma   0.569    
##  max |deriv| 2e-11                          gp       0.266    tau-a   0.267    
##                                             Brier    0.179                     
##  
##                             Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept                  -1.5203 0.7755 -1.96  0.0499  
##  age                         0.0442 0.0060  7.33  <0.0001 
##  d.motor=2                   0.4352 0.7625  0.57  0.5682  
##  d.motor=3                   0.0698 0.7608  0.09  0.9269  
##  d.motor=4                  -0.7078 0.7479 -0.95  0.3440  
##  d.motor=5                  -1.4894 0.7498 -1.99  0.0470  
##  d.motor=6                  -1.9808 0.8507 -2.33  0.0199  
##  pupil.i=one reactive        0.3042 0.2232  1.36  0.1729  
##  pupil.i=no reactive pupils  1.2298 0.1921  6.40  <0.0001 
## 
AIC(tbi_model)
## [1] 1137.076
tbi_dd <- datadist(tbi)
options(datadist="tbi_dd")
summary(tbi_model)
##              Effects              Response : d.unfav 
## 
##  Factor                                     Low High Diff. Effect  
##  age                                        23  41   18     0.79546
##   Odds Ratio                                23  41   18     2.21550
##  d.motor - 1:5                               5   1   NA     1.48940
##   Odds Ratio                                 5   1   NA     4.43430
##  d.motor - 2:5                               5   2   NA     1.92460
##   Odds Ratio                                 5   2   NA     6.85210
##  d.motor - 3:5                               5   3   NA     1.55910
##   Odds Ratio                                 5   3   NA     4.75470
##  d.motor - 4:5                               5   4   NA     0.78155
##   Odds Ratio                                 5   4   NA     2.18490
##  d.motor - 6:5                               5   6   NA    -0.49142
##   Odds Ratio                                 5   6   NA     0.61175
##  pupil.i - one reactive:both reactive        1   2   NA     0.30416
##   Odds Ratio                                 1   2   NA     1.35550
##  pupil.i - no reactive pupils:both reactive  1   3   NA     1.22980
##   Odds Ratio                                 1   3   NA     3.42050
##  S.E.    Lower 0.95 Upper 0.95
##  0.10859  0.582620   1.00830  
##       NA  1.790700   2.74090  
##  0.74983  0.019721   2.95900  
##       NA  1.019900  19.27900  
##  0.24740  1.439700   2.40950  
##       NA  4.219300  11.12800  
##  0.23633  1.095900   2.02230  
##       NA  2.992000   7.55600  
##  0.18096  0.426870   1.13620  
##       NA  1.532400   3.11500  
##  0.43750 -1.348900   0.36605  
##       NA  0.259530   1.44200  
##  0.22316 -0.133220   0.74154  
##       NA  0.875270   2.09920  
##  0.19205  0.853370   1.60620  
##       NA  2.347500   4.98380
plot(anova(tbi_model), what='proportion chisq') # relative importance

plot(Predict(tbi_model, fun=plogis)) # predicted values

#  
# penalty <- pentrace(mod1, penalty=c(0.5,1,2,3,4,6,8,12,16,24), maxit=25)
# mod1_pen <- update(mod1, penalty=penalty$penalty)
# effective.df(mod1_pen)
# mod1_pen

Alternative method with dummy encoding instead of factors

# can use dummies package to create dummy encoded variables
#install.packages('dummies')
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
d_tbi = dummy.data.frame(tbi, names = c("d.motor","pupil.i") , sep = "_")

# or model.matrix to create dummy encoded variables
#model.matrix(~tbi$d.motor)

d_model <- lrm(d.unfav ~ age + d.motor_1 + d.motor_2 + d.motor_3 + d.motor_4 + d.motor_6 + `pupil.i_no reactive pupils` + `pupil.i_one reactive`, data=d_tbi, x=TRUE, y=TRUE)
d_model
## Logistic Regression Model
##  
##  lrm(formula = d.unfav ~ age + d.motor_1 + d.motor_2 + d.motor_3 + 
##      d.motor_4 + d.motor_6 + `pupil.i_no reactive pupils` + `pupil.i_one reactive`, 
##      data = d_tbi, x = TRUE, y = TRUE)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs          1041    LR chi2     262.94    R2       0.304    C       0.784    
##   0            646    d.f.             8    g        1.346    Dxy     0.567    
##   1            395    Pr(> chi2) <0.0001    gr       3.841    gamma   0.569    
##  max |deriv| 1e-11                          gp       0.266    tau-a   0.267    
##                                             Brier    0.179                     
##  
##                             Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept                  -3.0097 0.2549 -11.81 <0.0001 
##  age                         0.0442 0.0060   7.33 <0.0001 
##  d.motor_1                   1.4894 0.7498   1.99 0.0470  
##  d.motor_2                   1.9246 0.2474   7.78 <0.0001 
##  d.motor_3                   1.5591 0.2363   6.60 <0.0001 
##  d.motor_4                   0.7815 0.1810   4.32 <0.0001 
##  d.motor_6                  -0.4914 0.4375  -1.12 0.2613  
##  pupil.i_no reactive pupils  1.2298 0.1921   6.40 <0.0001 
##  pupil.i_one reactive        0.3042 0.2232   1.36 0.1729  
## 
d_tbi_dd <- datadist(d_tbi)
options(datadist="d_tbi_dd")
summary(d_model)
##              Effects              Response : d.unfav 
## 
##  Factor                     Low High Diff. Effect   S.E.    Lower 0.95
##  age                        23  41   18     0.79546 0.10859  0.582620 
##   Odds Ratio                23  41   18     2.21550      NA  1.790700 
##  d.motor_1                   0   1    1     1.48940 0.74983  0.019721 
##   Odds Ratio                 0   1    1     4.43430      NA  1.019900 
##  d.motor_2                   0   1    1     1.92460 0.24740  1.439700 
##   Odds Ratio                 0   1    1     6.85210      NA  4.219300 
##  d.motor_3                   0   1    1     1.55910 0.23633  1.095900 
##   Odds Ratio                 0   1    1     4.75470      NA  2.992000 
##  d.motor_4                   0   1    1     0.78155 0.18096  0.426870 
##   Odds Ratio                 0   1    1     2.18490      NA  1.532400 
##  d.motor_6                   0   1    1    -0.49142 0.43750 -1.348900 
##   Odds Ratio                 0   1    1     0.61175      NA  0.259530 
##  pupil.i_no reactive pupils  0   1    1     1.22980 0.19205  0.853370 
##   Odds Ratio                 0   1    1     3.42050      NA  2.347500 
##  pupil.i_one reactive        0   1    1     0.30416 0.22316 -0.133220 
##   Odds Ratio                 0   1    1     1.35550      NA  0.875270 
##  Upper 0.95
##   1.00830  
##   2.74090  
##   2.95900  
##  19.27900  
##   2.40950  
##  11.12800  
##   2.02230  
##   7.55600  
##   1.13620  
##   3.11500  
##   0.36605  
##   1.44200  
##   1.60620  
##   4.98380  
##   0.74154  
##   2.09920
plot(anova(d_model), what='proportion chisq') # relative importance

rms::validate(d_model, method="boot", B=500) # bootstrapped validation
## singular information matrix in lrm.fit (rank= 8 ).  Offending variable(s):
## d.motor_1 
## 
## Divergence or singularity in 1 samples
##           index.orig training    test optimism index.corrected   n
## Dxy           0.5672   0.5728  0.5605   0.0123          0.5548 499
## R2            0.3037   0.3122  0.2951   0.0171          0.2866 499
## Intercept     0.0000   0.0000 -0.0154   0.0154         -0.0154 499
## Slope         1.0000   1.0000  0.9601   0.0399          0.9601 499
## Emax          0.0000   0.0000  0.0116   0.0116          0.0116 499
## D             0.2516   0.2600  0.2435   0.0165          0.2351 499
## U            -0.0019  -0.0019  0.0007  -0.0026          0.0007 499
## Q             0.2535   0.2619  0.2428   0.0191          0.2345 499
## B             0.1792   0.1774  0.1810  -0.0036          0.1828 499
## g             1.3459   1.3814  1.3212   0.0602          1.2857 499
## gp            0.2663   0.2692  0.2618   0.0074          0.2589 499
my.calib <- rms::calibrate(d_model, method="boot", B=500) # model calibration
plot(my.calib, las=1)

## 
## n=1041   Mean absolute error=0.005   Mean squared error=6e-05
## 0.9 Quantile of absolute error=0.014
#library(splines)

plot(smooth.spline(tbi$age, tbi$d.unfav, df=4), type="l", ylim=c(0,1))
points(tbi$age, tbi$d.unfav)

ggplot(tbi, aes(y=tbi$d.unfav, x=tbi$age)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'gam'

By plotting age vs unfavarable outcome it appears that younger age (from minimum to ~ 25) has little to no effect on an unfavorable outcome. Beyond age 25 there is an increasing linear relationship between age and unfavorable outcome.

(b) Obtain the predicted probability according to the model for each patient and plot these indicating values for favorable vs. unfavorable outcomes.

tbi_dd <- datadist(tbi)
options(datadist="tbi_dd")

ggplot(Predict(tbi_model, age, d.motor, pupil.i, fun=plogis))

ggplot(Predict(tbi_model, age, pupil.i, d.motor, fun=plogis))

y_prob <- predict(tbi_model, tbi, type="fitted")
plot(tbi$d.unfav, y_prob)

(c) What is the estimated fraction of unfavorable outcomes in the analyzed subset of patients? Verify that this is as expected for these data.

noquote((c('Fraction of unfavavorable outcomes in dataset: ', sum(tbi$d.unfav)/nrow(tbi))))
## [1] Fraction of unfavavorable outcomes in dataset: 
## [2] 0.379442843419789
# start with the naive threshold of 0.5
y_hat <- ifelse(y_prob <= 0.5, 1, 0)
noquote((c('Predicted fraction of unfavavorable outcomes in dataset: ', sum(y_hat)/nrow(tbi))))
## [1] Predicted fraction of unfavavorable outcomes in dataset: 
## [2] 0.704130643611912
# use ROC curve to find better threshold
roc.tbi <- roc(tbi$d.unfav, y_prob)
## Best cutoff by closest to topleft method
coords.tbi <- coords(roc = roc.tbi , x= "best" , input = "threshold", best.method = "closest.topleft")
y_hat <- ifelse(y_prob <= coords.tbi[1], 1, 0)
noquote((c('Alternate: Predicted fraction of unfavavorable outcomes in dataset: ', sum(y_hat)/nrow(tbi))))
## [1] Alternate: Predicted fraction of unfavavorable outcomes in dataset: 
## [2] 0.529298751200769

(d) Now, assess the discrimination of the model. Be sure to correct for the optimism of using the same data for fitting and assessing discrimination using the validate.lr function.

rms::validate(tbi_model, method="boot", B=500) # bootstrapped validation
##           index.orig training    test optimism index.corrected   n
## Dxy           0.5672   0.5749  0.5605   0.0144          0.5528 500
## R2            0.3037   0.3139  0.2949   0.0190          0.2847 500
## Intercept     0.0000   0.0000 -0.0201   0.0201         -0.0201 500
## Slope         1.0000   1.0000  0.9563   0.0437          0.9563 500
## Emax          0.0000   0.0000  0.0132   0.0132          0.0132 500
## D             0.2516   0.2616  0.2433   0.0183          0.2334 500
## U            -0.0019  -0.0019  0.0005  -0.0024          0.0005 500
## Q             0.2535   0.2635  0.2428   0.0207          0.2328 500
## B             0.1792   0.1771  0.1810  -0.0039          0.1831 500
## g             1.3459   1.3868  1.3205   0.0663          1.2796 500
## gp            0.2663   0.2701  0.2617   0.0084          0.2579 500
my.calib <- rms::calibrate(tbi_model, method="boot", B=500) # model calibration
plot(my.calib, las=1)

## 
## n=1041   Mean absolute error=0.005   Mean squared error=5e-05
## 0.9 Quantile of absolute error=0.012

(e) Summarize the results in (a)-(d).

The model overpredicts an unfavorable outcome with the selected predictors. Although it is stated that the strongest predictors were age, motor score and pupil reactivity, the p value obtained via rms.lrm are low for some. Perhaps a re-evaluation of ‘strongest predictors’ would be appropriate or looking at methods other than rms.lrm could yield better results.

(f) What would be the next steps in terms of assessing generalizability of this model to patients in an international trial?

Run the same model on the international observations (g below). Alternatively, find a new dataset with internal results and run on that.

(g) EXTRA CREDIT: Carry out the next steps in (f) using the analogous international observations in the dataset (trial=74). Summarize the results for this test dataset.